Skip to content

[ROCm][CI] Fix flaky Cohere/OpenAI embedding parity test#37616

Merged
noooop merged 4 commits intovllm-project:mainfrom
ROCm:akaratza_fixcohere_openai
Mar 25, 2026
Merged

[ROCm][CI] Fix flaky Cohere/OpenAI embedding parity test#37616
noooop merged 4 commits intovllm-project:mainfrom
ROCm:akaratza_fixcohere_openai

Conversation

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@mergify mergify bot added the rocm Related to AMD ROCm label Mar 19, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Mar 19, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a flaky test for Cohere/OpenAI embedding parity on ROCm by adding ROCM_EXTRA_ARGS to the test server's configuration. This introduces arguments to disable prefix caching and limit the maximum number of sequences to one on ROCm platforms. While this change successfully stabilizes the test, I have a concern that limiting sequences to one effectively disables batch processing, which undermines the purpose of the test_batch_parity test. My review includes a suggestion to handle this more explicitly to maintain test integrity.

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator Author

Testing MI325 to see if issue is resolved (added rocm and ready labels).

@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 20, 2026
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@mergify mergify bot added the frontend label Mar 20, 2026
@AndreasKaratzas AndreasKaratzas marked this pull request as ready for review March 20, 2026 17:25
@AndreasKaratzas AndreasKaratzas requested a review from noooop as a code owner March 20, 2026 17:25
@AndreasKaratzas
Copy link
Copy Markdown
Collaborator Author

await self._prepare_generators(ctx)
await self._collect_batch(ctx)
try:
await self._collect_batch(ctx)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? We now use app level error handlers to convert error responses

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkLight1337 The app-level Exception handler at api_server.py:270 handles dimensions=-1 correctly (returns 400), but for the immediately following dimensions=16 request on the same connection, the same ValueError from pooling_params.verify() escapes the Starlette ExceptionMiddleware and crashes the ASGI app. The client gets APIConnectionError instead of BadRequestError.

https://buildkite.com/vllm/amd-ci/builds/6711/steps/canvas?sid=019d088b-f229-4de8-923d-b4c48a62c6fb&tab=output

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndreasKaratzas AndreasKaratzas requested a review from njhill as a code owner March 25, 2026 09:32
Copy link
Copy Markdown
Collaborator

@noooop noooop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@noooop noooop enabled auto-merge (squash) March 25, 2026 09:56
Comment on lines 129 to 131
for dimensions in [-1, 16]:
with pytest.raises(openai.BadRequestError):
await make_request_and_correctness_test(dimensions)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why this test can pass on the NVIDIA GPU?

Or did this test not pass on the NVIDIA GPU?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, NVIDIA GPUs should also suffer from the issue below.

but for the immediately following dimensions=16 request on the same connection, the same ValueError from pooling_params.verify() escapes the Starlette ExceptionMiddleware and crashes the ASGI app. The client gets APIConnectionError instead of BadRequestError.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason the observed issue surfaces more times on ROCm is likely timing. Slower engine processing widens the window where the async generator cleanup and the ServerErrorMiddleware re-raise interact with the keep-alive connection state. On NVIDIA the race window is narrower, so the test may pass consistently or fail only intermittently. That's what I think it's going on.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's eye-opening.

How were you able to spot and fix a race condition bug? LOL

Copy link
Copy Markdown
Collaborator Author

@AndreasKaratzas AndreasKaratzas Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh didn't spot it in an exact line inside our stack 😅 I mostly emphasize on the network sluggishness part of our infra these days so I am guessing this is what is going on.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @DarkLight1337 @andyxning

I have a bad feeling that

  • some exceptions won't be caught by the app-level error handlers?
  • Or will the async main loop catch exceptions from another coroutine?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noooop noooop merged commit f262a62 into vllm-project:main Mar 25, 2026
49 of 50 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Mar 25, 2026
@AndreasKaratzas AndreasKaratzas deleted the akaratza_fixcohere_openai branch March 25, 2026 10:57
RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026
HenryTangDev pushed a commit to HenryTangMain/vllm that referenced this pull request Mar 27, 2026
malaiwah pushed a commit to malaiwah/vllm that referenced this pull request Mar 27, 2026
…t#37616)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
…t#37616)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026
…t#37616)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
…t#37616)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…t#37616)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
bhargav-patel-29 pushed a commit to Bharatgen-Tech/vllm that referenced this pull request Apr 1, 2026
…t#37616)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org>
liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026
rishitdholakia13 pushed a commit to rishitdholakia13/vllm that referenced this pull request Apr 7, 2026
…t#37616)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
…t#37616)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
big-yellow-duck pushed a commit to EmbeddedLLM/vllm that referenced this pull request Apr 8, 2026
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants